18.6 Noise-Contrastive Estimation¶
Noise-contrastive estimation estimates probability distribution by
where c is explicitly introduced as an approximation of \(-\log Z(\theta)\). Rather than estimate only \(\theta\), the NCE procedure treats c as just another parameter and estimate \(\theta\) and c simultaneously, using same algorithm for both. The resulting \(\log p_{model}(x)\) might not be correspond to exactly to a valid probability distribution, but it will become close and closer to being valid as the estimateof x improves.
NCE works by reducing the unsupervised learning problem of estimating p(x) to that of learning a probablistic binary classifier in which one of the categories corresponds to the data generated by the model. Specifically, we introduce a noise distribution \(p_{noise}(x)\) which is tractable to evaluate and sample from. We can now construct a model over both x and a new, binary class variable y. In the new joint model:
y is a switch variable that determines whether we will generate x from the model or from the noise distribution.
We can construct a similar joint model of training data. In this case, the switch variable determines whether we draw x from the data or from the noise.
We can now just use standard maximum likehood learning on the supervised learning problem of fitting \(p_{joint}\) to \(p_{train}\):
\(p_{joint}\): a logistic regression model applied to the difference in log probabilities of the model and the noise distribution:
NCE is simple to apply as long as
- \(\hat{p}_{model}\) is easy to back-propagate
- \(p_{noise}(x)\) is easy to evaluate in order to evaluate \(p_{joint}(x)\)
- \(p_{noise}(x)\) is easy to sample from to generate training data
When to use Noise Contrastive Estimation:
- NCE most successful when applied to problem with few random variables, but it can work well even if those random variables can take on a high number of values. E.g: modeling the conditional distribution over a work given the context of the word.
- Less efficient when applied to problem with many random variables. The logistic regression classifier can reject a noise sample by identifying any one variable whose value is unlikely. This means that learning slow greatly after \(p_{model}\) has learned the basic marginal statistics.
The constraint that \(p_{noise}\) must be easy to evaluate and easy to sample from can be overly restrictive. When \(p_{noise}\) is too simple, most samples are likely to be too obviously distinct from the data to force \(p_{model}\) to improve noticeably.
Like score matching and pseudolikehood, NCE does not work if only a lower bound on \(\hat{p}\) is available.
When the model distribution is copied to define a new noise distribution before each gradient step, NCE defines a procedure called self-contrastive estimation, whose expected gradient is equivalent to the expected gradient of maximum likehood.
- The special case of NCE where the noise samples are those generated by the model suggests that maxium likehood can be interpreted as a procedure that forces a model to constantly learn to distinguish reality from its own evolving beliefs, while noise
- Noise contrastive estimation achieves some reduced computational cost by only forcing the model to distinguish reality from a fixed baseline (the noise model).